The First Tour of the IPython Notebook

Why IPython Notebook?

I still remember that the first time I'd seen IPython Notebook was at Taipei.py. At that time, I wasn't sure why it's a good idea to write Python programs on a restricted environment in a browser. I mean, I couldn't even use my favorite Vim commands! I tried installing it and wrote a few short programs, but ended up not being interested to continue using the Notebook. On the other hand, I found the IPython interactive shell to be more convenient than the original Python shell, so I started using it more and more often when I needed to issue short Python commands.

My second experience with IPython Notebook was from the course materials of CS231n. In that course, each assignment was a IPython Notebook. You could complete the code on either the Notebook itself or independent Python scripts. The results could be evaluated immediately on the Notebook. I realized this was a fantastic way to share and communicate! There was also the nbviewer, which made it so easy to view all those Notebooks without having to actually set up the Python environment.

As I gained more experience with machine learning tasks using Python, I started to understand why IPython Notebook was so popular among the scientific computing community. I believed one of the reasons must be that it provided a very simple way to record everything you did.

最近在資料科學領域學習，最讓我感覺震驚的不是技術，反而是 science != engineering 的感覺，跟寫 code 做產品完全不一樣的思維和工作模式。打開 ipython 或 rstudio 就像打開實驗記錄簿，不斷假設、驗證、預測。這跟做軟體工程，差別真的蠻大的。
— i͛ho͌ͯͦ̉͑we̍̃̏ͣr̆̽̓ (@ihower) August 31, 2015

For example, when doing data science, there is a need to manage not just the source code but also the data. Tasks such as data preprocessing, data cleaning and feature extraction all requires some transformations of the data. Oftentimes, it seems that some of the transformations would be done for only once, and it is very tempting to just issue the command without even recording what has been done. This could be a disaster when you want to rerun the experiments under different settings for the early stages of the pipeline. Even if you do put down the commands on Python scripts, it is still difficult to figure out the order and parameters for these scripts later. On the other hand, if you put much effort to write a single script for all the data transformation and analysis straight from the original data every time runs, the running time may become unacceptable when dealing with big data. That's where IPython Notebook shines. It is a perfect notebook to record everything.

Data Analysis with IPython Notebook

So let's get started with our tour of IPython Notebook. Some simple tasks will be demonstrated using several libraries including:

In particular, matplotlib is a powerful graphing package for data visualization, and the close integration with IPython Notebook makes it even more useful.

Firstly, we use %matplotlib inline magic command to make matplotlib display plots directly inside the Notebook.



In [1]:

    
%matplotlib inline

Showing the Most Important Features

To get intuition on a model, finding the features that have the largest weights are often helpful. We will use the polarity dataset for the demonstration:



In [2]:

    
! wget http://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz
! tar xzf review_polarity.tar.gz









    



--2015-12-26 16:14:15--  http://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz
Resolving www.cs.cornell.edu (www.cs.cornell.edu)... 128.84.154.137
Connecting to www.cs.cornell.edu (www.cs.cornell.edu)|128.84.154.137|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3127238 (3.0M) [application/x-gzip]
Saving to: ‘review_polarity.tar.gz’

100%[======================================>] 3,127,238    655KB/s   in 5.6s   

2015-12-26 16:14:21 (543 KB/s) - ‘review_polarity.tar.gz’ saved [3127238/3127238]

Firstly load the required modules:



In [3]:

    
import numpy as np
import matplotlib.pyplot as plt

from sklearn.datasets import load_files
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC

We use TfidfVectorizer to get the TF-IDF feature vector for each sentence:



In [4]:

    
sent_data = load_files('txt_sentoken')

tfidf_vec = TfidfVectorizer()

sent_X = tfidf_vec.fit_transform(sent_data.data)
sent_y = sent_data.target

LinearSVC is used to train a classifier for positive and negative sentiments.



In [5]:

    
lsvc = LinearSVC()
lsvc.fit(sent_X, sent_y)









    Out[5]:





LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)

Finally, we show the most important features learned by the classifier.



In [6]:

    
def display_top_features(weights, names, top_n):
    top_features = sorted(zip(weights, names), key=lambda x: abs(x[0]), reverse=True)[:top_n]
    top_weights = [x[0] for x in top_features]
    top_names = [x[1] for x in top_features]
    
    fig, ax = plt.subplots(figsize=(16,8))
    ind = np.arange(top_n)
    bars = ax.bar(ind, top_weights, color='blue', edgecolor='black')
    for bar, w in zip(bars, top_weights):
        if w < 0:
            bar.set_facecolor('red')
    
    width = 0.30
    ax.set_xticks(ind + width)
    ax.set_xticklabels(top_names, rotation=45, fontsize=12)
    
    plt.show(fig)

display_top_features(lsvc.coef_[0], tfidf_vec.get_feature_names(), 20)

Word clouds are also an interesting way to show relative importance for different words:



In [7]:

    
from wordcloud import WordCloud



In [8]:

    
def generate_word_cloud(weights, names):
    return WordCloud(width=350, height=250).generate_from_frequencies(zip(names, weights))

def display_word_cloud(weights, names):
    fig, ax = plt.subplots(1, 2, figsize=(28, 10))
    
    pos_weights = weights[weights > 0]
    pos_names = np.array(names)[weights > 0]
    
    neg_weights = np.abs(weights[weights < 0])
    neg_names = np.array(names)[weights < 0]
    
    lst = [('Positive', pos_weights, pos_names), ('Negative', neg_weights, neg_names)]
    
    for i, (label, weights, names) in enumerate(lst):
        wc = generate_word_cloud(weights, names)
        ax[i].imshow(wc)
        ax[i].set_axis_off()
        ax[i].set_title('{} words'.format(label), fontsize=24)
    
    plt.show(fig)

display_word_cloud(lsvc.coef_[0], tfidf_vec.get_feature_names())

Visualization with Dimensionality Reduction

It's often difficult to understand data with high dimensionality. Therefore, dimensionality reduction is often used to help visualization. Here we will use t-SNE for the Iris flower data set. Additionally, we use MPLD3 to produce figures that could be zoomed in and zoomed out.



In [9]:

    
from sklearn.datasets import load_iris
from sklearn.manifold import TSNE
import mpld3



In [10]:

    
iris = load_iris()



In [11]:

    
def display_iris(data):
    X_tsne = TSNE(n_components=2, perplexity=20, learning_rate=50).fit_transform(data.data)
    
    fig, ax = plt.subplots(1, 2, figsize=(10, 5))
    ax[0].scatter(X_tsne[:, 0], X_tsne[:, 1])
    ax[0].set_title('All instances', fontsize=14)
    ax[1].scatter(X_tsne[:, 0], X_tsne[:, 1], c=data.target)
    ax[1].set_title('All instances labeled with color', fontsize=14)
    
    return mpld3.display(fig)

display_iris(iris)









    Out[11]:

As we could see, t-SNE does quite well to separate data points of different types even without knowing the label. Let's try a more complicated example with the MNIST dataset of handwritten digits. We will also use PointLabelTooltip to display the labels as tooltips.



In [12]:

    
from sklearn.datasets import fetch_mldata
from sklearn.decomposition import PCA



In [13]:

    
mnist = fetch_mldata('MNIST original')



In [14]:

    
def display_mnist(data, n_samples):
    X, y = data.data / 255.0, data.target
    
    # downsample as the scikit-learn implementation of t-SNE is unable to handle too much data
    indices = np.arange(X.shape[0])
    np.random.shuffle(indices)
    X_train, y_train = X[indices[:n_samples]], y[indices[:n_samples]]
    
    
    X_tsne = TSNE(n_components=2, perplexity=30).fit_transform(X_train)
    X_pca = PCA(n_components=2).fit_transform(X_train)
    
    fig, ax = plt.subplots(1, 2, figsize=(12, 6))


    points = ax[0].scatter(X_tsne[:,0], X_tsne[:,1], c=y_train)
    tooltip = mpld3.plugins.PointLabelTooltip(points, labels=y_train.tolist())
    mpld3.plugins.connect(fig, tooltip)
    ax[0].set_title('t-SNE')
    
    points = ax[1].scatter(X_pca[:,0], X_pca[:,1], c=y_train)
    tooltip = mpld3.plugins.PointLabelTooltip(points, labels=y_train.tolist())
    mpld3.plugins.connect(fig, tooltip)
    ax[1].set_title('PCA')
    
    
    return mpld3.display(fig)

display_mnist(mnist, 1000)









    Out[14]:

If your aim is to learn a projection vector when labels are available for the training data. LDA could also be used.



In [15]:

    
from mpl_toolkits.mplot3d import Axes3D
from sklearn.lda import LDA



In [16]:

    
def display_mnist_3d(data, n_samples):
    X, y = data.data / 255.0, data.target
    
    # downsample as the scikit-learn implementation of t-SNE is unable to handle too much data
    indices = np.arange(X.shape[0])
    np.random.shuffle(indices)
    X_train, y_train = X[indices[:n_samples]], y[indices[:n_samples]]
    
    X_lda = LDA(n_components=3).fit_transform(X_train, y_train)
    
    
    fig, ax = plt.subplots(figsize=(10,10), subplot_kw={'projection':'3d'})
    
    points = ax.scatter(X_lda[:,0], X_lda[:,1], X_lda[:,2] , c=y_train)
    ax.set_title('LDA')
    ax.set_xlim((-6, 6))
    ax.set_ylim((-6, 6))
    
    
    plt.show(fig)
    
display_mnist_3d(mnist, 1000)

Data exploration with Pandas

Pandas is quite useful for data analysis. Let's use the Meta Kaggle dataset to see how users are doing on the Kaggle website.



In [17]:

    
import pandas as pd
import sqlite3

After manually downloading the dataset, we extract the zipped file. There should be a output directory, containing the files.



In [18]:

    
con = sqlite3.connect('output/database.sqlite')
kaggle_df = pd.read_sql_query('''
SELECT * FROM Submissions''', con)

Display some entries:



In [19]:

    
kaggle_df.head()









    Out[19]:






  
    
      
      Id
      SubmittedUserId
      DateSubmitted
      TeamId
      PrivateScore
      PublicScore
      IsSelected
      ScoreStatus
      IsAfterDeadline
      DateScored
      ScoringDurationMilliseconds
    
  
  
    
      0
      2180
      647
      2010-04-29 22:32:08
      496
      56.2139
      55.7692
      False
      1
      False
      
      
    
    
      1
      2181
      619
      2010-04-30 09:38:29
      497
      50
      47.1154
      False
      1
      False
      
      
    
    
      2
      2182
      619
      2010-04-30 09:48:50
      497
      65.6069
      61.0577
      False
      1
      False
      
      
    
    
      3
      2184
      663
      2010-05-01 11:02:52
      499
      50
      47.1154
      False
      1
      False
      
      
    
    
      4
      2185
      673
      2010-05-02 08:04:38
      500
      62.2832
      61.0577
      False
      1
      False

Now, we would like to analyse the submission times. Firstly we obtain the day of week and the hour of week for each submission.



In [20]:

    
print('There is {} submissions'.format(kaggle_df.shape[0]))

# convert time strings to DatetimeIndex
kaggle_df['timestamp'] = pd.to_datetime(kaggle_df['DateSubmitted'])

print('The earliest and latest submissions are on {} and {}'.format(kaggle_df['timestamp'].min(), kaggle_df['timestamp'].max()))

kaggle_df['weekday'] = kaggle_df['timestamp'].dt.weekday
kaggle_df['weekhr'] = kaggle_df['weekday'] * 24 + kaggle_df['timestamp'].dt.hour









    



There is 934345 submissions
The earliest and latest submissions are on 2010-04-29 22:32:08 and 2015-08-31 23:58:44.050000



In [21]:

    
import calendar



In [22]:

    
def display_kaggle(df):
    fig, ax = plt.subplots(1, 2, figsize=(16, 8))
    
    ax[0].set_title('submissions per weekday')
    df['weekday'].value_counts().sort_index().rename_axis(lambda x: calendar.day_name[x]).plot.bar(ax=ax[0])
    
    ax[1].set_title('submissions per hour of week')
    ax[1].set_xticks(np.linspace(0, 24*7, 8))
    df['weekhr'].value_counts().sort_index().plot(color='red', ax=ax[1])
    plt.show(fig)
    
display_kaggle(kaggle_df)

Next, we try to cluster the users based on their submission patterns to see whether different groups might like to submit at different times.



In [23]:

    
from collections import defaultdict
from sklearn.cluster import KMeans



In [24]:

    
def display_hr(df, n_clusters):
    hrs_per_user = df[['SubmittedUserId', 'weekhr', 'Id']].groupby(['SubmittedUserId', 'weekhr']).count()
    total_per_user = hrs_per_user.sum(axis=0, level=0)
    user_patterns = (hrs_per_user / total_per_user)['Id']
    
    vectors = defaultdict(lambda: np.zeros(24*7))
    for (u, hr), r in user_patterns.items():
        vectors[u][hr] = r
    X_hr = np.array(list(vectors.values()))
    
    y = KMeans(n_clusters=n_clusters, random_state=3).fit_predict(X_hr)
    
    for i in range(n_clusters):
        fig, ax = plt.subplots(figsize=(6, 6))
        indices = y == i
        X = X_hr[indices]
        ax.plot(np.arange(24*7), X.mean(axis=0))
        ax.set_xticks(np.linspace(0, 24*7, 8))
        ax.set_xlim((0, 24*7))
        ax.set_title('Cluster #{}, n = {}'.format(i, len(X)), fontsize=14)
        plt.show(fig)
        
display_hr(kaggle_df, 9)

It seems that the users from Cluster#1 and Cluster#8 might indeed be active at different times. What do you think?

XKCD

Finally, let's draw a XKCD-style plot with matplotlib! To be able to draw this, we need to install Humor Sans, and clean the font cache directory. To get the path of the cache, use:

import matplotlib
matplotlib.get_cachedir()

As we use Python 3, additional packages are also required:

sudo apt-get install libffi-dev
pip3 install cairocffi



In [25]:

    
def xkcd():
    with plt.xkcd():
        fig, ax = plt.subplots()
        ax.spines['right'].set_color('none')
        ax.spines['top'].set_color('none')
        ax.set_xticks([])
        ax.set_yticks([])
        ax.set_ylim([-1, 10])
        
        data = np.zeros(100)
        data[:60] += np.linspace(-1, 0, 60)
        data[60:75] += np.arange(15)
        data[75:] -= np.ones(25)
        
        
        ax.annotate(
            'DEADLINE',
            xy=(71, 7), arrowprops=dict(arrowstyle='->'), xytext=(30, 2))
        
        ax.plot(data)
        ax.plot([72, 72], [-1, 15], 'k-', color='red')

        
        ax.set_xlabel('time')
        ax.set_ylabel('productivity')
        ax.set_title('productivity under a deadline')
        
        plt.show(fig)
xkcd()

	Id	SubmittedUserId	DateSubmitted	TeamId	PrivateScore	PublicScore	IsSelected	ScoreStatus	IsAfterDeadline
0	2180	647	2010-04-29 22:32:08	496	56.2139	55.7692	False	1	False
1	2181	619	2010-04-30 09:38:29	497	50	47.1154	False	1	False
2	2182	619	2010-04-30 09:48:50	497	65.6069	61.0577	False	1	False
3	2184	663	2010-05-01 11:02:52	499	50	47.1154	False	1	False
4	2185	673	2010-05-02 08:04:38	500	62.2832	61.0577	False	1	False